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Abstract 

This paper presents a syntactic lexicon for English that was originally derived from the 
Oxford Advanced Learner's Dictionary and the Oxford Dictionary of Current Idiomatic 
English, and then modified and augmented by hand. There are more than 37,000 syn- 
tactic entries from all 8 parts of speech. An X-windows based tool is available for main- 
taining the lexicon and performing searches. C and Lisp hooks are also available so that 
the lexicon can be easily utilized by parsers and other programs. 



1 Introduction 

One of the central needs of any wide-coverage 
parser is a large lexicon that contains the syn- 
tactic information for various lexical items. 
The creation of such a lexicon has tradition- 
ally been a very large and daunting task and 
most universities have shied away from it, leav- 
ing the creation of wide-coverage parsers to 
commercial institutions that could afford the 
time and personnel to devote to the creation of 
such a lexicon. The release of several machine- 
readable dictionaries (MRDs) into the public 
domain has opened new possibilities to gram- 
mar developers at research institutions, but 
the task did not become trivial. The problem 
of creating large scale lexicons changed from 
the tiresome, painstaking task of trying to de- 
velop individual word lists for various syntactic 
phenomena to the task of 'simply' extracting 
the information from the on-line dictionaries. 
This, however, has not turned out to be as sim- 
ple or straight-forward as researchers may have 
hoped. Machine readable dictionaries present 
numerous problems in terms of errors and in- 



consistencies in the various components of the 
lexical entries, making extraction quite diffi- 
cult. Many researchers abandon the extrac- 
tion process altogether because it consumes too 
many scarce resources. 

Although a number of researchers have ex- 
tracted information out of the various dictio- 
naries available, the resulting lexicons have 
not, in general, been made freely available 
to the NLP research community. In at 
Parroll and Grover, 198S 



least 



some cases 



Guthrie et al., 1993|| ) this is due to licensing 
restrictions on the source dictionaries. In re- 
sponse to the related problems of duplication of 
effort and non-availability of needed lexicons, 
there are currently several on-going projects to 
create syntactic lexicons and make them gen- 
erally available. 

• The Proteus Project at New York Uni- 
versity is developing the Comlex Syntac- 
tic Dictionary from scratch for release as 
one of the lexical resources in COMLEX 
(available through the Linguistic Data 
Consortium) [[Macleod et a/., 1994 . 



*Currently at SRA, Arlington, VA, 22201 USA; 
niartinp@sra.com 



The IITLEX project at Illinois Institute 
of Technology has an on- going project 



1 



to extract and release the information 
in the CoUins Enghsh Dictionary, along 
with information from various other word 
lists that will include both syntactic and 
semantic information. That system is 
still under development, however, and 
currently uses an expensive relational 
database package, a drawback which they 
plan to correct. fConlon, 1994| 



The syntactic lexicon described here con- 
tains approximately 37,000 entries extracted 
from the Oxford Advanced Learner's Dictio- 
nary of Current English [[Hornby, 1974|| and the 
Oxford Dictionary for Current Idiomatic En- 
glish [ICowie and Mackin, 1975|] . It is available 
via FTP in both an ASCII and a database for- 
mat. The database format uses a UNIX hash 
table facility [^eltzer and Yigit, 1991 ] that is 
freely distributed, and comes with an X- 
windows based interface for modifying the 
database and doing searches. C and Lisp hooks 
to allow other programs to use the database are 
also included. 

2 Syntactic Lexicon 

The syntactic lexicon has entries for 8 part- 
of-speech categories: Adjective, Adverb, Com- 
plementizer, Conjunction, Determiner, Noun, 
Preposition, and Verb. Each entry consists of 
the following required and optional fields: 

• INDEX field (required) - the uninflected 
form under which the lexical item is com- 
piled in the database; 

• ENTRY field (required) - contains all of the 
lexical items associated with the iNDExQ; 

• POS field (required) - gives the part-of- 
speech for the lexical item(s) in the entry 
field; 

• FRAME field (required) - contains the syn- 
tactic information about that entry; 

• FS field (optional) - the Feature Structure 
field may provide additional information 
about the frame field. 



• EX field (optional) - may be used for any 
number of example sentences. 

Note that lexical items may have more than 
one entry in the database (e.g. have) and that 
they may select the same frame field more 
than once, using the FS to capture lexical id- 
iosyncrasies (e.g. map). Table shows selected 
entries from the database. 





have 


ENTRY: 


have 


POS: 


Verb 


"PR AA/TF- 


An VI 1 1 Q r \ / ay \r\ 
1 V U-X-llldi J — V t:i U 


r o. 


1 r\Tr\ 1 n "h n i t i t 

viOt:b_Ull_illillllLl 


FY- 


John has to go to the store 


INDEX: 


have 


ENTRY- 






V 


"PR A i\/rp- 


Tr an s i t i ve _ Ve r b 


PQ- 


Non-Ergative 


PY- 


junii nab a piuuieni. 


INDEX: 


map 


ENTRY: 


map out 




veiu V CI u_jr ai Licie 


PR A A/TP- 


±1 ansiiive_ vei u_jr ai iicie 




map 


PNTRY- 


map 


POS: 


Noun 


FRAME: 


Base_Noun 




Noun_Determiner_required 




Noun_Modifier 


FS: 


wh— , reflexive— 


INDEX: 


map 


ENTRY: 


map 


POS: 


Noun 


FRAME: 


Noun_Determiner_not _requi 


FS: 


wh— , reflexive-, plural 



^For example, a verb particle construction would be 
iNDEXed under the verb, but would contain both the 
verb and the verb particle in the entry field. 



Table 1: Selected Syntactic Database Entries 

Because the syntactic database is part of the 
XTAG project |Poran et a/., 199^ , a on-going 
project to develop a wide-coverage parser for 
English (see Section]^, some entries in the syn- 
tactic lexicon reflect specific XTAG analyses. 
In fact, the graphical interface for the syntac- 
tic lexicon (described in Section ^ can run in 
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two modes - xtag and verbose. Tables 
and ^ were all generated in verbose mode. 

The vast majority of lexical items in the 
database fall into just 3 categories - Adjectives, 
Nouns, and Verbs. These three categories plus 
Adverbs are presented in more detail in the fol- 
lowing subsections. 

2.1 Adjectives 

There are 3,303 lexical adjectives in the 
database, of which 80 are 'Proper Name' adjec- 
tives, such as Chinese and American. Adjec- 
tives have 5 frames that they can select, which 
are listed below. Possible values for the FS field 
are wh— and wh-|-. 

• Base adjective: All adjectives. 

• Modifying adjective: Adjectives that 
can occur in direct modification contexts. 
Ex. the Chinese man. 

• Predicative adjective: Adjectives that 
can occur as the complement of a predica- 
tive verb. Ex. John was happy. 

• Predicative adjective w/ sentential 
complement: Adjectives that can occur 
as the complement of a predicative verb 
and that take a sentential complement. 
Ex. John was happy that Mary left Bill. 

• Predicative adjective w/ sentential 
subject: Adjectives that can occur as the 
complement of a predicative verb and that 
take a sentential subject. Ex. That John 
loves Mary is great.' 

2.2 Nouns 

Nouns are by far the largest category in the 
syntactic database, accounting for well over 
50% of the entries. Proper nouns and pronouns 
both have the part-of-speech Noun. Proper 
names, such as Danielle and Nicholas are 
not well-represented in the database, but geo- 
graphic names, particularly places in England, 
generally are0. The frames for nouns are simi- 
lar in many ways to the frames for adjectives, 

^This reflects the origin of tlie dictionary from which 
the lexicon was originally extracted. 



since nouns can modify other nouns and occur 
in predicative sentences. Other frames provide 
information about the use of the noun with 
determiners when forming noun phrases. The 
frames for noun are presented below: 

• Base noun: All nouns. 

• Noun Phrase with Determiner: 

Nouns that can take a determiner when 
forming a noun phrase. Ex. a man; *a 
jealousy 

• Noun Phrase without Determiner: 

Nouns that can appear without a deter- 
miner when forming a noun phrase. Ex. 
envy; *plant 

• Modifying noun: Nouns that can mod- 
ify other nouns. Note that not all nouns 
can modify other nouns. Proper nouns in 
general cannot modify other nouns, and 
specific lexical items may be restricted as 
well. Ex. basketball game; *John car 

• Noun with sentential complement: 

Nouns that take sentential complements. 
Ex. the fact that Mary loves John... 

• Predicative noun: Nouns that can occur 
as the complement of a predicative verb. 
Ex. John was a man. 

• Predicative noun w/ sentential sub- 
ject: Nouns that can occur as the comple- 
ment of a predicative verb and that take 
a sentential subject. Ex. That John loves 
Mary is a crime. 

Because this lexicon is used in the XTAG 
system, the lexicon often indicates precise syn- 
tactic behavior, rather than simply placing a 
general label on a lexical item. For the class 
of nouns, this is seen in the specification of 
nouns with respect to their co-occurrence with 
determiners. Instead of assigning a general la- 
bel as as 'common noun' or 'mass noun', the 
noun frames explicitly indicate whether certain 
forms of the noun can appear with or with- 
out a determiner. However, since the syntac- 
tic database is indexed on root forms only, the 
morphology of the lexical item is not avail- 
able. Instead, the FS field is used to indicate 
any restrictions on a particular use of a lexical 
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item. For example, in Table |I], the noun map 
occurs twice. The first time that it appears, 
it selects the Noun_Determiner_required 
frame. The feature structures associated with 
it indicates only that the noun is not a wh- 
word, and that it is not reflexive. No re- 
strictions are made with respect to its mor- 
phology. In contrast, the second entry, which 
selects the Noun_Determiner_not_required 
has plural as part of its FS. This indicates 
that the noun for this frame is restricted to its 
plural form. Hence map can only occur with 
a determiner, but maps is free to occur both 
with or without one. Nouns that belong to the 
class of so-called 'mass nouns' would not have 
the plural restriction on the entry that selects 
the Noun_Determiner_not_required frame, 
thereby indicating that the singular form is also 
allowed to occur without a determiner. 

2.3 Verbs 

Verbs, with their varied subcategorization 
frames, are perhaps the most interesting lexi- 
cal items in a syntactic lexicon. There are over 
8100 verbs (not including auxiliary verbs) that 
make up almost 9000 entries in the database. 
There are 19 different frames that the verbs 
can select, including transitive, intransitive, 
sentential complement, sentential subject, verb 
particle constructions (transitive and intransi- 
tive), double objects with shifting, double ob- 
jects without shifting, and light verb construc- 
tions. 

As with the nouns, the FS field is used 
to provide a more concise format for speci- 
fying the frames for each lexical item. For 
the verbs, the FS field is used to spec- 
ify the difference between ergative and non- 
ergative transitive verbs, as can be seen 
in the have entry in Table |I], and is also 
used heavily for further differentiating the 
frames for verbs that take sentential com- 
plements. There are two frames for senten- 
tial complements - SententiaLComplement 
and NP_and_Sentential_Complement. Ei- 
ther of these can occur with the feature 
structures Infinitive_Complement, Indica- 
tive_Complement, or Predicative_Comp- 
lement. This reduces the number of values 
for FRAME that are necessary to cover all of 



the possible lexical environments, and also al- 
lows for easier searches across categories. To 
find all the verbs that take infinitive comple- 
ments, one can simply search on the Infini- 
tive_Complement feature structure, rather 
than having to specify each frame that could 
fill this role. Table |^ shows some values for var- 
ious verbs that take sentential complements. 



iiN JJll/A. 


want 




want 


POS: 


Verb 


FRAME: 


Sent ent iaLComplement 


r o. 


Infinit ive_Complement 




Dan wants to finish this paper. 


iiN JJil/A. 


want 




want 


POS: 


Verb 


FRAME: 


NP_and_Sentential Complement 


T7C! . 

r d: 


Infinitive_Complement 


JiA: 


Dan wants Al to finish this paper. 


ii\ JJJI/A. 


think 


Jl;i\ 1 rt 1 . 


think 


POS: 


Verb 


FRAME: 


SententiaLComplement 


T?C. 

r o. 


Indicative_Complement 


H/A: 


Dan thought that the paper was done. 


iiN iJil/A. 


think 


il;i\ i il 1 . 


think 


pnci- 

ryjo. 


Verb 


FRAME: 


SententiaLComplement 


FS: 


Infinit ive_Complement 


EX: 


Doug thought to clean the kitchen. 


INDEX: 


think 


ENTRY: 


think 


POS: 


Verb 


FRAME: 


SententiaLComplement 


FS: 


Predicative_Complement 


EX: 


Dan thought Carl a jerk. 



Table 2: Verbs with Sentential Complements 



2.3.1 Auxiliary verbs 

The lexical entries for auxiliary verbs are very 
closely tied to the XTAG analysis, which or- 
ders the auxiliary verbs based on their mor- 
phological forms. Each entry in the lexicon 
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is restricted via the FS field to only a cer- 
tain form of the auxiliary verb (present, past, 
ppart, etc), which also indicates what other 
forms that it can go onQ. Table § shows the 
entries for the auxiliary verbs for the sentence 
John should have been waiting. 



INDEX: 
ENTRY: 
POS: 
FRAME: 



ahead 

ahead 

Adverb 

Base_Adverb 

Post- VP 

Pre-PP 



INDEX: 


should 


INDEX: 


essentially 


1 — 1 TV T r 1 1 T~A "\ T 

ENTRY: 


should 


ENTRY: 


essentially 


POS: 


Verb 


POS: 


Adverb 


FRAME: 


Auxiliary _Verb 


TREES: 


Base -Adverb 


FS: 


Indicative, Present, Goes_on_Base 




Pre- VP 








Pre-S 


INDEX: 


have 




Post-S 


ENTRY: 


have 






POS: 


Verb 


INDEX: 


even 


FRAME: 


Auxiliary_Verb 


ENTRY: 


even 


FS: 


Base, Goes_on_Past -Participle 


POS: 


Adverb 






FRAME: 


Base_Adverb 


INDEX: 


be 




Pre- VP 


ENTRY: 


be 




Pre-Adj 


POS: 


Verb 




Pre-Noun 


FRAME: 


Auxiliary_Verb 




Pre-PP 


FS: 


Past_Participle, Goes_on_Gerund 










INDEX: 


very 


Table 3: 


Example Auxiliary Verb Entries 


ENTRY: 


very 






POS: 


Adverb 






FRAME: 


Base_Adverb 



2.4 Adverbs 

A syntactic lexicon for adverbs is particularly 
useful because adverbs are so idiosyncratic as 
to where they can occur in a sentence. Al- 
though there are only 169 adverbs in the syn- 
tactic lexicon, but there are 15 different frame 
values that they can select. These include basic 
adverb, pre and post verb phrases, pre and post 
sentences, pre and post adjective, pre-adverb, 
pre-preposition, pre-noun, etc. Table |^ shows 
some selected adverb entries. 



3 File Formats 

The information in the syntactic database is 
available both in an ASCII 'flat' flle, and a 
hashed database format. The ASCII flle con- 
tains one entry per line, and each fleld is clearly 



■^For a more detailed description of this and other 
XTAG analyses, please see the XTAG Technical Report 



Pre-Adj 
Pre-Adv 

Table 4: Some Adverb Lexical Entries 



marked. This format is easily usable by vari- 
ous UNIX*™ utilities such as grep and awk, and 
it can be easily parsed by custom programs. 

The hashed database format is very useful 
for programs that need quick access to the in- 
formation in the database. Each entry is in- 
dexed under the index key, and a single call 
to the database for a particular index returns 
all of the entries that share that index. This 
makes it particular useful for parsers. The 
database uses an encoding scheme for the POS, 
FRAME, and FS flelds, which condenses the 
space required for the database and shortens 
the search time for non-index flelds. All of the 
entries for a given lexical index can be retrieved 
in 1.6 msecs, on average. 



[[rhe XTAG Project, 199-1 1 
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4 Interface 

Although the format of the flat file is excel- 
lent for various file utilities programs, and the 
database format works well for retrieving en- 
tries quickly, neither is particularly well-suited 
for human readability. The X-windows inter- 
face0 for the syntactic database allows users 
to easily look at the database. Searching is 
available not only on the index under which 
the lexical item is stored, but also on all 
other fields, with the exception of the EX field. 
Searches may also be done on combinations^] 
of fields. For instance, one could search on 
POS = Noun and FS = wh+ to find the set 
of all wh-|- nouns {what, who, whom, which, 
when). Figure ^ shows the interface after a 
search has been done on the index need. All of 
the entries with that index are listed in a scroll 
window, which can be browsed through using 
the Next and Previous buttons, or specific 
entries can be clicked on, and the entire record 
will show in the upper window. The results of 
searches can be saved to a file to create smaller 
'custom' lexicons. In addition to searching the 
database, users can also easily add, delete and 
modify individual entries, tailoring the syntac- 
tic database to fit their needs. Users may also 
delete all entries found in a given search, and 
we hope to add the capacity to modify a entire 
set of entries in the future. 



5 Statistics 

Statistics were gathered on the coverage of 
the syntactic lexicon on the IBM, ATIS, 
WSJ, and Brown corpora. These corpora 
were chosen because they have been tagged 
and hand corrected by the TreeBank project 
Santorini, 199CI|| . The data in Table ^ show 



the coverage of the lexicon on various corpora. 
A lexical item/part-of-speech pair is counted 
as a hit if the lexical item is in the syntac- 
tic lexicon with the indicated tag. No attempt 
was made to determine if the lexicon had the 
correct frame needed to parse the sentence. 
Because the syntactic lexicon contains only 



^The interface uses the MIT Athena Toolkit, which 
is distributed with the standard MIT X release. 

^We hope to add expand this in the future to include 



M xsyn 



File 1 1 Options | [search 1 1 Hodif jj 1 1 Add 1 1 Delete | | Clear 



Index: 
Entry: 
POS; 
Franes: 



[peed 



^ I Part of Speech List | 



Houn_DeL_MiLh_SentenLiaI_Conplenent 
Houn_HoDeb_HiLh_SenLenbial_ConpIenenb 



Inf iniLiwe_CDnpienenL 
refleHiwe- 



|ndd Frane bo Lisb] 



Delete Frane fron List 



Rdd Feature to List 



Delete Feabure fron Lisb | 



Add EHanple bo Lisb | 



Delebe Enanple fron Lisb | 



Record it 5 of 7 




Figure 1: Result of a search on the index need 



Corpus 


Number 
of Hits 


Total # 
of Words 


Percent 
Hit 


WSJ 


1974528 


2462557 


80.18% 


Brown 


799904 


991008 


80.72% 


IBM 


60944 


68800 


88.58% 


ATIS 


10156 


13791 


73.64% 



Table 5: Percentage of Hits for various corpora 

the root form of lexical entries, the inflected 
form was first looked up in the morphol- 



fuU regular expression searches. 



ogy database [[Karp et al, 1992| to retrieve the 
root form, and then that was used for the 
syntactic lexicon. Items that were not found 
in the morphological database were counted 
against the syntactic lexicon, as the morphol- 
ogy database is a superset of the syntactic 
database]^. The statistics in Table |^ are over 
all word occurrences in the corpora^, so words 

^Because these databases are being used in an actual 
parser, an attempt was made some time ago to make 
ensure that all words in the syntactic lexicon appear in 
the morphological database. Although the databases 
may have diverged slightly since then, it should not be 
statistically significant. 

^Numbers and the genitive marker ('s) were taken 
out before the statistics were compiled. 
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Number of 


Percent 


Percent 


Percent 


Percent 


Percent 


Corpus 


Non-hits 


Proper N 


Nouns 


Adj 


Adv 


Verbs 


WSJ 


488029 


43.8% 


30.7% 


13.8% 


5.7% 


1.3% 


Brown 


191104 


26.2% 


40.6% 


14.8% 


7.4% 


1.8% 


IBM 


7856 


17.1% 


56.9% 


11.3% 


2.8% 


2.5% 


ATIS 


3635 


67.4% 


14.0% 


1.6% 


0.6% 


2.4% 



Table 6: Percentage of missing words for various Parts of Speech 



that occur frequently are given more weight. 

Not surprisingly, nouns and proper nounsQ 
comprise the largest category of words missed, 
followed by adjective, adverbs, and verbs. Ta- 
ble H shows the percentage of each of these cat- 
egories in the list of items not found. Again, 
this is a percentage of word occurrences in the 
corpora. 

As Table |^ indicates, the majority of the 
missing items are either nouns or proper nouns 
(66.8% - 81.4%). This is not surprising, nor 
particularly distressing, as nouns tend to be 
the easiest items to 'guess' information about. 
Verbs, which tend to be the hardest, are rea- 
sonably well-covered in this lexicon. The num- 
ber of adjectives not covered, however, seems 
fairly high, and we plan to add a number of 
those missing to the syntactic lexicon. 

6 Future Work 

The lexicon in its present form does not pro- 
vide a mechanism to specify preferences of lex- 
ical items for certain syntactic structures. As 
part of future enhancements to the lexicon we 
hope to associate probabilities with each entry. 
The probabilities will reflect the affinity of the 
lexical item for the syntactic structure associ- 
ated with that entry. These probabilities will 
be computed from parsed corpora. 

It has been observed quite conclusively in 
recent work in lexicography that certain com- 
binations of words co-occur more often than 
would be expected if they corresponded to ar- 
bitrary usages of the individual words. Collo- 
cational information has been shown to be of 
immense use in pruning the search space for a 



parser. We hope to eventually extract colloca- 
tional information from the corpora and make 
it a part of the syntactic lexicon. 

7 Related Work 

The syntactic lexicon was developed as part 
of the XTAG project |poran et ai, 1994|| at 
the University of Pennsylvania under the di- 
rection of Dr. Aravind Joshi. The XTAG sys- 
tem is a wide-coverage parser and grammar for 
English based on the Tree Adjoining Gram- 



^Although we do not distinguish nouns and proper 
nouns in the syntactic lexicon, the TreeBank tags do 
make this distinction, and it seemed useful to continue 
this distinction for this part of the analysis. 



mar (TAG) formalism || Joshi et al, 19751) . The 
English grammar consists of 3 sections - a 
morphology database, a syntactic database, 
and a tree grammar. Together with a 
parser and an X-windows interface, they 
comprise the XTAG system. Both the 
morphology |[Karp et ai, 1992] and syntactic 
databases are available separately. The en- 
tire XTAG system is also freely available to 
the NLP research community. Information 
about the entire XTAG system and FTP in- 
structions may be obtained by writing xtag- 
request@linc.cis.upenn.edu. 

8 Computer Platform 

The syntactic lexicon and accompanying inter- 
face were developed on the Sun SPARC station 
series, as were the other tools mentioned in Sec- 
tion 1^. All of the XTAG tools, including the 
syntactic lexicon and interface, are freely avail- 
able without limitation through anonymous 
FTP to ftp.cis.upenn.edu. The syntactic 
lexicon and accompanying programs together 
require about 9MB of space (for both the 
ASCII and DB versions of the lexicon). Please 
send mail to lex-request@linc.cis.upenn.edu for 
current FTP instructions or for more informa- 
tion. 
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